Enable PP and EP overlap for MoE #1721

H-Huang · 2025-09-18T17:10:35Z

Option 2 of #1682

These changes add a custom overlap_callback function to replace the OVERLAP_F_B action that is run during the schedule execution. In the custom function, we write run_forward() and run_backward(). run_backward() is run as a separate thread so that we can have both forward and backward running together side by side. Looks like this:

In order for these changes to work with Expert Parallel, we also need to add custom autograd functions to act as the boundary points at which we do communication. We added hooks before and after expert parallel dispatch and combine to signal boundary points, so our figure from before now turns into:

Now in each of these red blocks, we use a global coordinator. We need threading.Barrier(2).wait() so that the comm and compute from our forward and backward steps are scheduled in lock-step before continuing.

DSv3 16B run command:

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8  CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh

Trace examples:

H-Huang · 2025-09-22T21:53:53Z

Running with:

NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh

CUDA_LAUNCH_BLOCKING

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" CUDA_LAUNCH_BLOCKING=1 ./run_train.sh

H-Huang · 2025-10-08T17:25:48Z

Just landed pytorch/pytorch#162016, so once CI picks up the nightly the errors should be fixed

tianyu-l

Looks very cool! Left some comments and questions.

Also looking forward to benchmarking results with overlapping enabled vs. disabled. In particular, for the 16B model, we should be able to test out on 8 GPUs, assuming SAC is composable.

torchtitan/distributed/pipeline_parallel.py

tianyu-l · 2025-10-09T06:00:38Z

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml


 [activation_checkpoint]
-mode = "selective"  # ["none", "selective", "full"]
+mode = "none"  # ["none", "selective", "full"]


does it not support SAC?

AC/SAC are not supported since we split the backward into two parts.

tianyu-l · 2025-10-09T06:02:11Z

torchtitan/models/deepseek_v3/__init__.py

        mscale=0.70,
-        use_flex_attn=True,
-        attn_mask_type="block_causal",
+        use_flex_attn=False,


Is FlexAttention not supported? It sounds unrelated.

torchtitan/distributed/pipeline_parallel.py

Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in pytorch/torchtitan#1721 Pull Request resolved: #165513 Approved by: https://github.com/fegin

torchtitan/distributed/pipeline_parallel.py

…#165513) Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in pytorch/torchtitan#1721 Pull Request resolved: pytorch#165513 Approved by: https://github.com/fegin

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

H-Huang force-pushed the deepseek-v3-new-methods branch from 3a61b86 to 0f7a7c9 Compare September 22, 2025 21:52

H-Huang force-pushed the deepseek-v3-new-methods branch from 0f7a7c9 to 6584aac Compare September 24, 2025 21:56

H-Huang force-pushed the deepseek-v3-new-methods branch from a6e46c7 to 5810c54 Compare October 8, 2025 14:51

H-Huang changed the title ~~[Option 2 Example] Dont land~~ Enable PP and EP overlap for MoE Oct 8, 2025

H-Huang marked this pull request as ready for review October 8, 2025 17:25

H-Huang requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 8, 2025 17:25

tianyu-l reviewed Oct 9, 2025

View reviewed changes

yuankaichen-amd reviewed Oct 9, 2025

View reviewed changes

torchtitan/distributed/pipeline_parallel.py Outdated Show resolved Hide resolved

H-Huang mentioned this pull request Oct 15, 2025

[PP] Update backward_counter and fsdp util to schedule class pytorch/pytorch#165513

Closed

H-Huang force-pushed the deepseek-v3-new-methods branch from 9e43a67 to 7cf98e4 Compare October 15, 2025 04:18

wwwjn reviewed Oct 16, 2025

View reviewed changes

torchtitan/distributed/pipeline_parallel.py Outdated Show resolved Hide resolved

Enable PP and EP overlap for MoE

c29fa82

H-Huang force-pushed the deepseek-v3-new-methods branch from 7cf98e4 to c29fa82 Compare October 24, 2025 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable PP and EP overlap for MoE #1721

Enable PP and EP overlap for MoE #1721

Uh oh!

H-Huang commented Sep 18, 2025 •

edited

Loading

Uh oh!

H-Huang commented Sep 22, 2025 •

edited

Loading

Uh oh!

H-Huang commented Oct 8, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 9, 2025

Uh oh!

H-Huang Oct 24, 2025 •

edited

Loading

Uh oh!

tianyu-l Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable PP and EP overlap for MoE #1721

Are you sure you want to change the base?

Enable PP and EP overlap for MoE #1721

Uh oh!

Conversation

H-Huang commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang commented Oct 8, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

H-Huang commented Sep 18, 2025 •

edited

Loading

H-Huang commented Sep 22, 2025 •

edited

Loading

H-Huang Oct 24, 2025 •

edited

Loading